dplyr package.dplyr function select.dplyr function
filter.dplyr function
to the input of another function with the ‘pipe’ operator
%>%.mutate.summarize, group_by, and
count to split a dataframe into groups of observations,
apply a summary statistics for each group, and then combine the
results.dplyr is a package for making tabular
data wrangling easier by using a limited set of functions that can be
combined to extract and summarize insights from your data.
Like readr,
dplyr is a part of the tidyverse. These
packages were loaded in R’s memory when we called
library(tidyverse) earlier.
The packages in the tidyverse, namely
dplyr, tidyr
and ggplot2 accept both the British
(e.g. summarise) and American (e.g. summarize)
spelling variants of different function and option names. For this
lesson, we utilize the American spellings of different functions;
however, feel free to use the regional variant for where you are
teaching.
The package dplyr provides easy tools
for the most common data wrangling tasks. It is built to work directly
with dataframes, with many common tasks optimized by being written in a
compiled language (C++) (not all R packages are written in R!).
There are also packages available for a wide range of tasks including
building plots (ggplot2, which we’ll see
later), downloading data from the NCBI database, or performing
statistical analysis on your data set. Many packages such as these are
housed on, and downloadable from, the Comprehensive
R Archive Network
(CRAN) using install.packages. This function makes the
package accessible by your R installation with the command
library(), as you did with tidyverse
earlier.
To easily access the documentation for a package within R or RStudio,
use help(package = "package_name").
To learn more about dplyr after the
workshop, you may want to check out this handy
data transformation with dplyr
cheatsheet.
There are alternatives to the tidyverse packages for
data wrangling, including the package data.table.
See this comparison
for example to get a sense of the differences between using
base, tidyverse, and
data.table.
dplyrTo make sure everyone will use the same dataset for this lesson, we’ll read again the kings dataset that we downloaded earlier.
## load the tidyverse
library(tidyverse)
## load the data
kings <- read_csv("data/kings.csv", na = "NULL")
## inspect the data
kings
## preview the data
# view(kings)We’re going to learn some of the most common
dplyr functions:
select(): subset columnsfilter(): subset rows on conditionsmutate(): create new columns by using information from
other columnsgroup_by() and summarize(): create summary
statistics on grouped dataarrange(): sort resultscount(): count discrete valuesTo select columns of a dataframe, use select(). The
first argument to this function is the dataframe (kings),
and the subsequent arguments are the columns to keep, separated by
commas. Alternatively, if you are selecting columns adjacent to each
other, you can use a : to select a range of columns, read
as “select columns from ___ to ___.” You may have done something similar
in the past using subsetting. select() is essentially doing
the same thing as subsetting, using a package (dplyr)
instead of R’s base functions.
# to select columns throughout the dataframe
select(kings, Name, Start_year, End_year)
# to do the same thing with subsetting
kings[c("Name", "Start_year", "End_year")]
# to select a series of connected columns
select(kings, Name:End_year)To choose rows based on specific criteria, we can use the
filter() function. The argument after the dataframe is the
condition we want our final dataframe to adhere to (e.g. House is
Gorm):
# filters observations where House is "Gorm"
filter(kings, House == "Gorm")# A tibble: 7 × 11
Name House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Gorm den Ga… Gorm 936 958 908 958 M Jellin… https… "908" "958"
2 Harald 1. B… Gorm NA NA 936 985 M Jellin… https… "936" "987"
3 Toke_Gormsen Gorm 985 986 NA 986 M Jellin… https… "" ""
4 Svend 1. Tv… Gorm NA NA NA 1014 M Jellin… https… "17/04… "03/02…
5 Harald 2. Gorm 1014 1018 963 NA M Jellin… https… "994" "1018"
6 Knud 1. den… Gorm 1018 1035 NA 1035 M Jellin… https… "995" "12/11…
7 Hardeknud Gorm 1035 1042 995 1042 M Jellin… https… "1018" "08/06…
# … with abbreviated variable names ¹Start_year, ²End_year, ³Birth_year, ⁴Death_year,
# ⁵BirthDMY, ⁶DeathDMY
We can also specify multiple conditions within the
filter() function. We can combine conditions using either
“and” or “or” statements. In an “and” statement, an observation (row)
must meet every criteria to be included in the
resulting dataframe. To form “and” statements within dplyr, we can pass
our desired conditions as arguments in the filter()
function, separated by commas:
# filters observations with "and" operator (comma)
# output dataframe satisfies ALL specified conditions
filter(kings, House == "Estridsen",
End_year > 1100,
End_year < 1210)# A tibble: 7 × 11
Name House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Erik 1. Eje… Estr… 1095 1103 1050 1103 M Jellin… "http… "1055" "10/07…
2 Niels Estr… 1104 1134 1055 1134 M Jellin… "http… "1134" "25/06…
3 Erik 2. Emu… Estr… 1134 1137 1065 1137 M Jellin… "http… "1100" "18/09…
4 Erik 3. Lam Estr… 1137 1146 1100 1146 M Jellin… "http… "1120" "27/08…
5 Svend 3.; K… Estr… 1146 1157 1120 NA M Jellin… "" "" ""
6 Valdemar 1.… Estr… 1157 1182 1131 1182 M Jellin… "http… "14/01… "12/05…
7 Knud 4. Estr… 1182 1202 1163 1202 M Jellin… "http… "1163" "12/11…
# … with abbreviated variable names ¹Start_year, ²End_year, ³Birth_year, ⁴Death_year,
# ⁵BirthDMY, ⁶DeathDMY
We can also form “and” statements with the &
operator instead of commas:
# filters observations with "&" logical operator
# output dataframe satisfies ALL specified conditions
filter(kings, House == "Estridsen" &
End_year > 1100 &
End_year < 1210)# A tibble: 7 × 11
Name House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Erik 1. Eje… Estr… 1095 1103 1050 1103 M Jellin… "http… "1055" "10/07…
2 Niels Estr… 1104 1134 1055 1134 M Jellin… "http… "1134" "25/06…
3 Erik 2. Emu… Estr… 1134 1137 1065 1137 M Jellin… "http… "1100" "18/09…
4 Erik 3. Lam Estr… 1137 1146 1100 1146 M Jellin… "http… "1120" "27/08…
5 Svend 3.; K… Estr… 1146 1157 1120 NA M Jellin… "" "" ""
6 Valdemar 1.… Estr… 1157 1182 1131 1182 M Jellin… "http… "14/01… "12/05…
7 Knud 4. Estr… 1182 1202 1163 1202 M Jellin… "http… "1163" "12/11…
# … with abbreviated variable names ¹Start_year, ²End_year, ³Birth_year, ⁴Death_year,
# ⁵BirthDMY, ⁶DeathDMY
In an “or” statement, observations must meet at least one of the specified conditions. To form “or” statements we use the logical operator for “or,” which is the vertical bar (|):
# filters observations with "|" logical operator
# output dataframe satisfies AT LEAST ONE of the specified conditions
filter(kings, House == "Estridsen" | End_year < 1100)# A tibble: 28 × 11
Name House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Gorm den G… Gorm 936 958 908 958 M "Jelli… https… "908" "958"
2 Toke_Gorms… Gorm 985 986 NA 986 M "Jelli… https… "" ""
3 Harald 2. Gorm 1014 1018 963 NA M "Jelli… https… "994" "1018"
4 Knud 1. de… Gorm 1018 1035 NA 1035 M "Jelli… https… "995" "12/11…
5 Hardeknud Gorm 1035 1042 995 1042 M "Jelli… https… "1018" "08/06…
6 Magnus den… Fair… 1042 1047 1018 1047 M "Jelli… https… "1024" "25/10…
7 Svend 2. E… Estr… 1047 1074 1024 1076 M "" https… "1019" "28/04…
8 Harald 3. … Estr… 1074 1080 1019 1080 M "Jelli… https… "1041" "17/04…
9 Knud 2. de… Estr… 1080 1086 1041 1086 M "Jelli… https… "1043" "10/07…
10 Oluf 1. Hu… Estr… 1086 1095 1043 1095 M "Jelli… https… "1050" "18/08…
# … with 18 more rows, and abbreviated variable names ¹Start_year, ²End_year,
# ³Birth_year, ⁴Death_year, ⁵BirthDMY, ⁶DeathDMY
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary dataframe and use that as input to the next function, like this:
kings_male <- filter(kings, Gender == "M")
kings_new <- select(kings_male, Name:Death_year)
kings_new# A tibble: 52 × 6
Name House Start_year End_year Birth_year Death_year
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gorm den Gamle Gorm 936 958 908 958
2 Harald 1. Blåtand Gorm NA NA 936 985
3 Toke_Gormsen Gorm 985 986 NA 986
4 Svend 1. Tveskæg Gorm NA NA NA 1014
5 Harald 2. Gorm 1014 1018 963 NA
6 Knud 1. den Store Gorm 1018 1035 NA 1035
7 Hardeknud Gorm 1035 1042 995 1042
8 Magnus den Gode Fairhair 1042 1047 1018 1047
9 Svend 2. Estridsen Estridsen 1047 1074 1024 1076
10 Harald 3. Hen Estridsen 1074 1080 1019 1080
# … with 42 more rows
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
kings_new <- select(filter(kings, House == "Estridsen"),
Name:End_year)This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipes
let you take the output of one function and send it directly to the
next, which is useful when you need to do many things to the same
dataset. Pipes in R look like %>% and are made available
via the magrittr package, installed
automatically with dplyr. If you use
RStudio, you can type the pipe with:
kings %>%
filter(House == "Gorm") %>%
select(Name:End_year)# A tibble: 7 × 4
Name House Start_year End_year
<chr> <chr> <dbl> <dbl>
1 Gorm den Gamle Gorm 936 958
2 Harald 1. Blåtand Gorm NA NA
3 Toke_Gormsen Gorm 985 986
4 Svend 1. Tveskæg Gorm NA NA
5 Harald 2. Gorm 1014 1018
6 Knud 1. den Store Gorm 1018 1035
7 Hardeknud Gorm 1035 1042
In the above code, we use the pipe to send the kings
dataset first through filter() to keep rows where
House is “Gorm”, then through select() to keep
only the columns from Name to End_year. Since
%>% takes the object on its left and passes it as the
first argument to the function on its right, we don’t need to explicitly
include the dataframe as an argument to the filter() and
select() functions any more.
Some may find it helpful to read the pipe like the word “then”. For
instance, in the above example, we take the dataframe
kings, then we filter for rows with
House == "Gorm", then we select
columns Name:End_year. The
dplyr functions by themselves are somewhat
simple, but by combining them into linear workflows with the pipe, we
can accomplish more complex data wrangling operations.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
kings_gorm <- kings %>%
filter(House == "Gorm") %>%
select(Name:End_year)
kings_gorm# A tibble: 7 × 4
Name House Start_year End_year
<chr> <chr> <dbl> <dbl>
1 Gorm den Gamle Gorm 936 958
2 Harald 1. Blåtand Gorm NA NA
3 Toke_Gormsen Gorm 985 986
4 Svend 1. Tveskæg Gorm NA NA
5 Harald 2. Gorm 1014 1018
6 Knud 1. den Store Gorm 1018 1035
7 Hardeknud Gorm 1035 1042
Note that the final dataframe (kings_gorm) is the
leftmost part of this expression.
Using pipes, subset the kings data to include kings with
non-missing reign dates, who were born in the 14th century
(Birth_year > 1300 & Birth_year <
1399) and retain only the columns Name, House,
Start_year and End_year.
kings %>%
filter(Birth_year > 1300 & Birth_year < 1399) %>%
select(Name, House, Start_year, End_year) # A tibble: 5 × 4
Name House Start_year End_year
<chr> <chr> <dbl> <dbl>
1 Valdemar 3. Estridsen 1326 1329
2 Valdemar 4. Atterdag Estridsen 1340 1375
3 Oluf 2. Bjelbo 1375 1387
4 Margrete 1. Estridsen 1387 1396
5 Erik 7. af Pommern Pomerania 1396 1439
Frequently you’ll want to create new columns based on the values in
existing columns, for example to do unit conversions, or to find the
ratio of values in two columns. For this we’ll use
mutate().
We might be interested in the number of years each king lived, reigned, and what the mid-year of each ruler’s reign was (such as you will need to plot the rulers on a timeline):
kings %>%
mutate(Reign_duration = End_year - Start_year)# A tibble: 57 × 12
Name House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Gorm den G… Gorm 936 958 908 958 M "Jelli… https… "908" "958"
2 Harald 1. … Gorm NA NA 936 985 M "Jelli… https… "936" "987"
3 Toke_Gorms… Gorm 985 986 NA 986 M "Jelli… https… "" ""
4 Svend 1. T… Gorm NA NA NA 1014 M "Jelli… https… "17/04… "03/02…
5 Harald 2. Gorm 1014 1018 963 NA M "Jelli… https… "994" "1018"
6 Knud 1. de… Gorm 1018 1035 NA 1035 M "Jelli… https… "995" "12/11…
7 Hardeknud Gorm 1035 1042 995 1042 M "Jelli… https… "1018" "08/06…
8 Magnus den… Fair… 1042 1047 1018 1047 M "Jelli… https… "1024" "25/10…
9 Svend 2. E… Estr… 1047 1074 1024 1076 M "" https… "1019" "28/04…
10 Harald 3. … Estr… 1074 1080 1019 1080 M "Jelli… https… "1041" "17/04…
# … with 47 more rows, 1 more variable: Reign_duration <dbl>, and abbreviated variable
# names ¹Start_year, ²End_year, ³Birth_year, ⁴Death_year, ⁵BirthDMY, ⁶DeathDMY
We may be interested in investigating whether being a member of a specific House had any effect on the duration of reign. To look at this relationship, we will first remove data from our dataset where the dates are missing. These cases are recorded as “NULL” in the dataset.
To remove these cases, we could insert a filter() in the
chain:
kings %>%
filter(!is.na(Start_year)) %>%
mutate(Reign_duration = End_year - Start_year) %>%
mutate(Midyear = End_year - (End_year-Start_year) / 2 )# A tibble: 52 × 13
Name House Start…¹ End_y…² Birth…³ Death…⁴ Gender Dynasty Source Birth…⁵ Death…⁶
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
1 Gorm den G… Gorm 936 958 908 958 M "Jelli… https… "908" "958"
2 Toke_Gorms… Gorm 985 986 NA 986 M "Jelli… https… "" ""
3 Harald 2. Gorm 1014 1018 963 NA M "Jelli… https… "994" "1018"
4 Knud 1. de… Gorm 1018 1035 NA 1035 M "Jelli… https… "995" "12/11…
5 Hardeknud Gorm 1035 1042 995 1042 M "Jelli… https… "1018" "08/06…
6 Magnus den… Fair… 1042 1047 1018 1047 M "Jelli… https… "1024" "25/10…
7 Svend 2. E… Estr… 1047 1074 1024 1076 M "" https… "1019" "28/04…
8 Harald 3. … Estr… 1074 1080 1019 1080 M "Jelli… https… "1041" "17/04…
9 Knud 2. de… Estr… 1080 1086 1041 1086 M "Jelli… https… "1043" "10/07…
10 Oluf 1. Hu… Estr… 1086 1095 1043 1095 M "Jelli… https… "1050" "18/08…
# … with 42 more rows, 2 more variables: Reign_duration <dbl>, Midyear <dbl>, and
# abbreviated variable names ¹Start_year, ²End_year, ³Birth_year, ⁴Death_year,
# ⁵BirthDMY, ⁶DeathDMY
The ! symbol negates the result of the
is.na() function. Thus, if is.na() returns a
value of TRUE (because the Start_year is
missing), the ! symbol negates this and says we only want
values of FALSE, where Start_year is
not missing.
##@ Exercise 2
Load the kings data and create a new dataframe that
meets the following criteria: contains the Name and
House column and three new columns called
Years_lived,Reign_duration, and
Midyear containing values that encompass the monarch’s
life, duration of reign and the middle year of their reign (end of rule
year - duration/2). Only the rows where Reign_duration is
greater than 25 should be shown in the final dataframe.
Hint: think about how the commands should be ordered to produce this data frame!
kings <- kings %>%
mutate(Years_lived = Death_year - Start_year) %>%
mutate(Reign_duration = End_year - Start_year) %>%
mutate(Midyear = End_year - (End_year-Start_year) / 2 ) %>%
select(Name, House, Gender, Years_lived, Reign_duration, Midyear)Many data analysis tasks can be approached using the
split-apply-combine paradigm: split the data into groups, apply
some analysis to each group, and then combine the results.
dplyr makes this very easy through the use
of the group_by() function.
summarize() functiongroup_by() is often used together with
summarize(), which collapses each group into a single-row
summary of that group. group_by() takes as arguments the
column names that contain the categorical variables for
which you want to calculate the summary statistics. So to compute the
average ………:
kings %>%
group_by(House) %>%
summarize(mean_rule = mean(Reign_duration))# A tibble: 9 × 2
House mean_rule
<chr> <dbl>
1 "" NA
2 "Bjelbo" 12
3 "Estridsen" 14.9
4 "Fairhair" 5
5 "Gorm" NA
6 "Oldenburg" 25.7
7 "Palatinate-Neumarkt" 8
8 "Pomerania" 43
9 "Schleswig-Holstein-Sonderburg-Glücksborg" 32
You may also have noticed that the output from these calls doesn’t
run off the screen anymore. It’s one of the advantages of
tbl_df over dataframe.
You can also group by multiple columns:
kings %>%
group_by(House, Gender) %>%
summarize(mean_rule = mean(Reign_duration))# A tibble: 11 × 3
# Groups: House [9]
House Gender mean_rule
<chr> <chr> <dbl>
1 "" "" NA
2 "Bjelbo" "M" 12
3 "Estridsen" "F" 9
4 "Estridsen" "M" 15.2
5 "Fairhair" "M" 5
6 "Gorm" "M" NA
7 "Oldenburg" "M" 25.7
8 "Palatinate-Neumarkt" "M" 8
9 "Pomerania" "M" 43
10 "Schleswig-Holstein-Sonderburg-Glücksborg" "F" 51
11 "Schleswig-Holstein-Sonderburg-Glücksborg" "M" 27.2
When grouping both by House and Gender, we
see rows in our table for monarchs of different gender.
kings %>%
group_by(House) %>%
summarize(mean_years = mean(Years_lived))# A tibble: 9 × 2
House mean_years
<chr> <dbl>
1 "" NA
2 "Bjelbo" 12
3 "Estridsen" NA
4 "Fairhair" 5
5 "Gorm" NA
6 "Oldenburg" 28.1
7 "Palatinate-Neumarkt" 8
8 "Pomerania" 63
9 "Schleswig-Holstein-Sonderburg-Glücksborg" NA
Notice that there are a number of NAs in the result. We can exclude those data from our table using a filter step.
kings %>%
filter(!is.na(Years_lived)) %>%
group_by(House) %>%
summarize(mean_years = mean(Years_lived))# A tibble: 8 × 2
House mean_years
<chr> <dbl>
1 Bjelbo 12
2 Estridsen 17.9
3 Fairhair 5
4 Gorm 11.8
5 Oldenburg 28.1
6 Palatinate-Neumarkt 8
7 Pomerania 63
8 Schleswig-Holstein-Sonderburg-Glücksborg 27.2
kings %>%
filter(!is.na(Years_lived)) %>%
group_by(House) %>%
summarize(mean_years = round(mean(Years_lived),1))# A tibble: 8 × 2
House mean_years
<chr> <dbl>
1 Bjelbo 12
2 Estridsen 17.9
3 Fairhair 5
4 Gorm 11.8
5 Oldenburg 28.1
6 Palatinate-Neumarkt 8
7 Pomerania 63
8 Schleswig-Holstein-Sonderburg-Glücksborg 27.2
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum and maximum years ruled for each House for each group (men or women):
kings %>%
filter(!is.na(Reign_duration)) %>%
group_by(House, Gender) %>%
summarize(max_rule = max(Reign_duration),
min_rule = min(Reign_duration))# A tibble: 10 × 4
# Groups: House [8]
House Gender max_rule min_rule
<chr> <chr> <dbl> <dbl>
1 Bjelbo M 12 12
2 Estridsen F 9 9
3 Estridsen M 39 2
4 Fairhair M 5 5
5 Gorm M 22 1
6 Oldenburg M 60 9
7 Palatinate-Neumarkt M 8 8
8 Pomerania M 43 43
9 Schleswig-Holstein-Sonderburg-Glücksborg F 51 51
10 Schleswig-Holstein-Sonderburg-Glücksborg M 43 6
It is sometimes useful to rearrange the result of a query to inspect
the values. For instance, we can sort on min_membrs to put
the group with the shortest reign first:
kings %>%
filter(!is.na(Reign_duration)) %>%
group_by(House, Gender) %>%
summarize(max_rule = max(Reign_duration),
min_rule = min(Reign_duration)) %>%
arrange(min_rule)# A tibble: 10 × 4
# Groups: House [8]
House Gender max_rule min_rule
<chr> <chr> <dbl> <dbl>
1 Gorm M 22 1
2 Estridsen M 39 2
3 Fairhair M 5 5
4 Schleswig-Holstein-Sonderburg-Glücksborg M 43 6
5 Palatinate-Neumarkt M 8 8
6 Estridsen F 9 9
7 Oldenburg M 60 9
8 Bjelbo M 12 12
9 Pomerania M 43 43
10 Schleswig-Holstein-Sonderburg-Glücksborg F 51 51
To sort in descending order, we need to add the desc()
function. If we want to sort the results by decreasing order of minimum
household size:
kings %>%
filter(!is.na(Reign_duration)) %>%
group_by(House, Gender) %>%
summarize(max_rule = max(Reign_duration),
min_rule = min(Reign_duration)) %>%
arrange(desc(min_rule))# A tibble: 10 × 4
# Groups: House [8]
House Gender max_rule min_rule
<chr> <chr> <dbl> <dbl>
1 Schleswig-Holstein-Sonderburg-Glücksborg F 51 51
2 Pomerania M 43 43
3 Bjelbo M 12 12
4 Estridsen F 9 9
5 Oldenburg M 60 9
6 Palatinate-Neumarkt M 8 8
7 Schleswig-Holstein-Sonderburg-Glücksborg M 43 6
8 Fairhair M 5 5
9 Estridsen M 39 2
10 Gorm M 22 1
When working with data, we often want to know the number of
observations found for each factor or combination of factors. For this
task, dplyr provides count().
For example, if we wanted to count the number of rows of data for each
House, we would do:
kings %>%
count(House)# A tibble: 9 × 2
House n
<chr> <int>
1 "" 3
2 "Bjelbo" 1
3 "Estridsen" 22
4 "Fairhair" 1
5 "Gorm" 7
6 "Oldenburg" 16
7 "Palatinate-Neumarkt" 1
8 "Pomerania" 1
9 "Schleswig-Holstein-Sonderburg-Glücksborg" 5
For convenience, count() provides the sort
argument to get results in decreasing order:
kings %>%
count(House, sort = TRUE)# A tibble: 9 × 2
House n
<chr> <int>
1 "Estridsen" 22
2 "Oldenburg" 16
3 "Gorm" 7
4 "Schleswig-Holstein-Sonderburg-Glücksborg" 5
5 "" 3
6 "Bjelbo" 1
7 "Fairhair" 1
8 "Palatinate-Neumarkt" 1
9 "Pomerania" 1
Now that you have nicely enriched the kings, we can save this
dataframe to our data_output directory.
write_csv(kings, file = "data_output/kings_plotting.csv")::::::::::::::::::::::::::::::::::::::: challenge
Suggested steps:
Select the Names and Reign_duration
columns with select() filter() the kings
dataset Reign_duration > the-average-length that you
calculated before and count the resulting rows with
count()
# for the average years of rule
kings %>%
filter(Reign_duration > 29.5) %>%
count()# A tibble: 1 × 1
n
<int>
1 14
Use mutate() to create a new column
AboveAverage and populate it with TRUE/FALSE which will be
a response to the question of whether the Reign_duration
value is bigger or smaller than average-length Group the unique
values in the AboveAverage column with
group_by() and then count(). What is your
result?
kings %>%
mutate(AboveAverage = Reign_duration > 29) %>%
group_by(AboveAverage) %>%
count()# A tibble: 3 × 2
# Groups: AboveAverage [3]
AboveAverage n
<lgl> <int>
1 FALSE 38
2 TRUE 14
3 NA 5
Sort kings by reign duration in ascending and descending order. Which three ruled the longest/shortest?
How many days did the three longest-ruling kings rule? Use mutate() to create a column called Days calculating the total umber of days they ruled. Consider transition year in your calculation! (366 days)
kings %>%
arrange(Reign_duration)# A tibble: 57 × 6
Name House Gender Years…¹ Reign…² Midyear
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Toke_Gormsen Gorm M 1 1 986.
2 Abel Estridsen M 2 2 1251
3 Erik 2. Emune Estridsen M 3 3 1136.
4 Valdemar 3. Estridsen M 38 3 1328.
5 Christoffer 2. Estridsen M 3 3 1330.
6 Harald 2. Gorm M NA 4 1016
7 Magnus den Gode Fairhair M 5 5 1044.
8 Harald 3. Hen Estridsen M 6 6 1077
9 Knud 2. den Hellige Estridsen M 6 6 1083
10 Frederik 8. Schleswig-Holstein-Sonderburg-Glück… M 6 6 1909
# … with 47 more rows, and abbreviated variable names ¹Years_lived, ²Reign_duration
grepl() or str_detect() functions and a
wild card.kings %>%
filter(str_detect(Name, "Christian*|Fred*"))# A tibble: 18 × 6
Name House Gender Years_l…¹ Reign…² Midyear
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Christian 2. Oldenburg M 46 10 1518
2 Frederik 1. Oldenburg M 10 10 1528
3 Christian 3. Oldenburg M 23 23 1548.
4 Frederik 2. Oldenburg M 29 29 1574.
5 Christian 4. Oldenburg M 60 60 1618
6 Frederik 3. Oldenburg M 22 22 1659
7 Christian 5. Oldenburg M 29 29 1684.
8 Frederik 4. Oldenburg M 34 31 1714.
9 Christian 6. Oldenburg M 16 16 1738
10 Frederik 5. Oldenburg M 20 20 1756
11 Christian 7. Oldenburg M 42 42 1787
12 Frederik 6. Oldenburg M 31 31 1824.
13 Christian 8. Oldenburg M 9 9 1844.
14 Frederik 7. Oldenburg M 15 15 1856.
15 Christian 9. Schleswig-Holstein-Sonderburg-Glücksborg M 43 43 1884.
16 Frederik 8. Schleswig-Holstein-Sonderburg-Glücksborg M 6 6 1909
17 Christian 10. Schleswig-Holstein-Sonderburg-Glücksborg M 35 35 1930.
18 Frederik 9. Schleswig-Holstein-Sonderburg-Glücksborg M 25 25 1960.
# … with abbreviated variable names ¹Years_lived, ²Reign_duration
dplyr package to manipulate dataframes.select() to choose variables from a dataframe.filter() to choose data based on values.group_by() and summarize() to work
with subsets of data.mutate() to create new variables.